7 research outputs found
Learning from Children: Improving Image-Caption Pretraining via Curriculum
Image-caption pretraining has been quite successfully used for downstream
vision tasks like zero-shot image classification and object detection. However,
image-caption pretraining is still a hard problem -- it requires multiple
concepts (nouns) from captions to be aligned to several objects in images. To
tackle this problem, we go to the roots -- the best learner, children. We take
inspiration from cognitive science studies dealing with children's language
learning to propose a curriculum learning framework. The learning begins with
easy-to-align image caption pairs containing one concept per caption. The
difficulty is progressively increased with each new phase by adding one more
concept per caption. Correspondingly, the knowledge acquired in each learning
phase is utilized in subsequent phases to effectively constrain the learning
problem to aligning one new concept-object pair in each phase. We show that
this learning strategy improves over vanilla image-caption training in various
settings -- pretraining from scratch, using a pretrained image or/and
pretrained text encoder, low data regime etc.Comment: ACL Findings 202
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
The field of vision-and-language (VL) understanding has made unprecedented
progress with end-to-end large pre-trained VL models (VLMs). However, they
still fall short in zero-shot reasoning tasks that require multi-step
inferencing. To achieve this goal, previous works resort to a
divide-and-conquer pipeline. In this paper, we argue that previous efforts have
several inherent shortcomings: 1) They rely on domain-specific sub-question
decomposing models. 2) They force models to predict the final answer even if
the sub-questions or sub-answers provide insufficient information. We address
these limitations via IdealGPT, a framework that iteratively decomposes VL
reasoning using large language models (LLMs). Specifically, IdealGPT utilizes
an LLM to generate sub-questions, a VLM to provide corresponding sub-answers,
and another LLM to reason to achieve the final answer. These three modules
perform the divide-and-conquer procedure iteratively until the model is
confident about the final answer to the main question. We evaluate IdealGPT on
multiple challenging VL reasoning tasks under a zero-shot setting. In
particular, our IdealGPT outperforms the best existing GPT-4-like models by an
absolute 10% on VCR and 15% on SNLI-VE. Code is available at
https://github.com/Hxyou/IdealGPTComment: 13 pages, 5 figure
Multimodal Event Graphs: Towards Event Centric Understanding of Multimodal World
Understanding how events described or shown in multimedia content relate to
one another is a critical component to developing robust artificially
intelligent systems which can reason about real-world media. While much
research has been devoted to event understanding in the text, image, and video
domains, none have explored the complex relations that events experience across
domains. For example, a news article may describe a `protest' event while a
video shows an `arrest' event. Recognizing that the visual `arrest' event is a
subevent of the broader `protest' event is a challenging, yet important problem
that prior work has not explored. In this paper, we propose the novel task of
MultiModal Event Event Relations to recognize such cross-modal event relations.
We contribute a large-scale dataset consisting of 100k video-news article
pairs, as well as a benchmark of densely annotated data. We also propose a
weakly supervised multimodal method which integrates commonsense knowledge from
an external knowledge base (KB) to predict rich multimodal event hierarchies.
Experiments show that our model outperforms a number of competitive baselines
on our proposed benchmark. We also perform a detailed analysis of our model's
performance and suggest directions for future research